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^^ , Abstract. We test a recently proposed model of commuting networks 

-^1 ' on 80 case studies from different regions of the world (Europe and United- 

States) and with geographic units of different sizes (municipality, county, 
region). The model takes as input the number of commuters coming in 
^H ' and out of each geographic unit and generates the matrix of commuting 

^^ , flows betwen the geographic units. We show that the single parameter of 

the model, which rules the compromise between the influence of the dis- 
tance and job opportunities, follows a universal law that depends only on 
2 . the average surface of the geographic units. We verifled that the law de- 

rived from a part of the case studies yields accurate results on other case 
studies. We also show that our model signiflcantly outperforms the two 
other approaches proposing a universal commuting model (Balcan et al., 
2009; Simini ct al., 2012), particularly when the geographic units are 
small (e.g. municipalities). 
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1 Introduction 



m 

o 

Cn [ Commuting flows constitute the circulatory system of the modern societies: 

millions of people move every day from home to workplace and generate a 
network of socio-economic relationships wiring municipalities, counties or re- 
l^ . gions. These networks are the vector of several social and economic dynam- 

''T^ I ics such as epidemic outbreaks, information flows, city development and traf- 

C^ . fie (Ortiizar and Willumscn, 2011; Balcan et al., 2009). Understanding their es- 

sential properties and reproducing them accurately is therefore a crucial issue 
for public health institutions, policy makers, urban development, infrastructure 
planners, etc. (Dc Montis ct al., 2007, 2010) 

In the abundant literature devoted to this challenge (see (Barthelemy, 2011; 
Rouwendal and Nijkamp, 2004) for reviews), the intuition a law inspired by 
gravitational attraction is widely accepted (Wilson, 1998; Choukroun, 1975): 
the number of commuters between two geographic units (cities, counties, re- 
gions...) is proportional to the product of the "masses" of each geographic unit 
(the population for example) and inversely proportional to a function of the 
distance between them. Unfortunately, numerous experiences showed that the 



shape of the function of the distance and the basic parameter(s) of the model 
should be fixed in an ad- hoc manner for each case studies (de Vries et al., 2009; 
De Montis et al., 2007, 2010; Fothcringham, 1981). Therefore, it is impossible to 
generate commuting networks when data are lacking with this method. 

In this paper, we show an universal law rules the single parameter of a recently 
proposed model (Gargiulo et al., 2012; Lcnormand et al., 2012), which shows 
two main differences with the usual gravity law models: 

— It takes as input the total number of commuters in and out from each ge- 
ographic unit, instead of the population in usual gravity law models. It is 
hence more data demanding, but these data are widely available. From these 
data, the model reconstructs the whole network of flows between the geo- 
graphic units. 

— It builds the network progressively, considering dispatches commuters one 
by one in the different flows and it updates the virtual commuters in and 
out for each geographic unit after each virtual commuter out choice. This 
update allows to ensure the generated numbers of virtual commuters in and 
out for each unit are the same as the ones given by the observed data. 
The individual flow allocation follows a probability which increases with 
the number of commuters coming in the destination and decreases with the 
distance between the considered geographic units. 

We test this model on 80 case-studies with geographic units of different sizes (for 
example in the same case-study the geographic unit can be either the municipal- 
ity, the canton or the department. Fig. 1): Czech Republic (municipality scale, 1 
region), France (municipality scale, 34 regions), France (canton scale, 14 regions 
-|- all France), France (departement level (all France), Italy (municipality level, 
10 regions), Italy (province level, 4 regions), USA (county level, 14 regions + all 
USA) . We show that the single parameter of our model follows a simple universal 
law that depends only on the average area of the considered geographic units. 
This implies that, given the number of commuters in and out for each geographic 
unit and their average surface, we can derive the whole matrix of flows with a 
very good confidence. 

Two other approaches (Balcan et al., 2009; Simini et al., 2012) can generate 
commuting networks only from population and job data. We show that our 
approach yields significantly more accurate results, especially when considering 
small geographic units (municipalities). 



2 A simple model 

The basic factors structuring the most commonly applied model of commuting 
networks, the "doubly-constrained" model (Wilson, 1998; Choukroun, 1975) in- 
clude the number of commuters out and commuters in of the geographic units, 
and the distances between these units. The idea behind this choice is that indi- 
viduals decide a work location taking into account the job offers and the distance 
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Fig. 1. Different sizes of unit for the French region Auvergne 



to this work location. The distance is particularly important for everyday com- 
muting, which is the most frequent case. We keep this basic setup, without 
adding any ingredient about the job market characteristics (professions, salary 
range, etc.). We propose a simple individual based procedure that allocates the 
individuals one by one to the different flows between geographical units, accord- 
ing to a probability that is inspired by gravitational models and that is updated 
after each allocation. More precisely, the probability for an individual, living in 
unit Ma, to work in unit Ui is given by: 

„in -/SDxi 

where (s™) is the number of commuters entering in unit Ui. D\i is the Euclidian 
distance in meters between units u\ and Ui (computable from the Lambert or GIS 
coordinates). This data is available in National Statistical offices (see appendix 
Datasets for more details), as well as s'^* , the number of commuters going out 
from unit u\. We choose a probability decreasing exponentially with the distance, 
in accordance with the investigations carried out in Lcnormand et al. (2012) and 
with the literature on commuting network models. The impact of the distance 
is embedded in parameter /3: For /3 ^> the probability is independent from the 
distance, while for high values of /3, the probability tends to zero very rapidly 
when the distance increases, independently from the job offer. 

We now describe the procedure in more details. The individuals live in a 
geographical area characterized by n territorial units, u\ G U with A £ |[l,n]|, 
among which we want to generate the commuting network. Since a relevant part 
of our individuals can work outside the n units, especially those living close to 
the border of our area, to reduce the border effect (see (Lcnormand et al., 2012)), 
we consider the job-search basin is an extended (EXT) area, composed by the 



n residential units and m units surrounding the area (thus, we have Ntot = 
n + m units in total, Ui £ U^-^^ with i G |[1, Ntot]\)- The algorithm simulates 
individual searches for workplaces. At each time step we select unit u\ at random 
among the residence units and one of its s™* available commuters. We draw at 
random the working place Ui of this individual according to probabilities P\^i . 
Then we decrement of one s™* and s"\ Note that decrementing s*" and s°"* at 
each step complicates significantly the derivation of an analytical expression of 
the model. The generated network is saved in matrix W G MnxNTOT^^) where 
each entry W\i represents the number of commuters between units u\ ^ U and 
Ui G U^^"'" . The algorithm is summarized in Fig. 2. 
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Fig. 2. Algorithm describing the network generation model 



3 A universal law ruling the parameter f3 

We calibrated parameter l3 by minimizing the Kolmogorov-Smirnov (KS) dis- 
tance between the observed and simulated distributions of commuting distances. 
We consider indeed that the distance distribution is an essential features that 
the model should reproduce. We checked this choice using the common part of 
commuters (CPC), based on the S0rensen index (Seirensen, 1948), which quan- 
tifies the similarity between the observed and simulated networks. Basically, the 
CPC computes which part of the commuting fiows is correctly reproduced, on 
average, by the simulated network. The indicator varies between 0, when no 
agreement is found and 1 when the two networks are identical. We verified that 
the value of /3 that minimises the KS distance also maximises the CPC (see 
(Gargiulo et al., 2012; Lenormand et al., 2012) and in the appendix Statistical 
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Fig. 3. Log-log scatter plot of the calibrated beta values in terms of average unit area 
(in km^) for 80 regions; the line represents the regression line predicting the /? value. 



Tools for details). Moreover, we observed that the corresponding value of CPC 
is alvsfays higher than 0.70 with an average around 0.8 for all the case-studies, 
which shows a high similarity of the networks (see Fig. 4). 

The next question is: How does the value of /3 vary with the global charac- 
teristics of the case-study? Actually, our results show that the optimal value of 
parameter /3 follows a universal rule depending only on the average surface of 
the geographic units. This rule is shown on Fig. 3, where the x-axis represents 
the average surface of the geographic units in the area ((5*) in logarithm scale) 
and the y-axis the optimal /3 value (in logarithm scale). The linear regression in 
the log-log plane, shows a simple relation: 



f3^a(S)- 



(2) 



with a = 0.000315, i^ = 0.177. The high value of the adjusted i?^ = o.92 con- 
firms the quality of the fit. We observe that f3 decreases with the average surface 
of the units (S), meaning that, when (S) is small (e.g. for French municipalities) 
the distance is more important in the commuting choice than when {S) is large 
(e.g. for regions or counties). 

We use a cross-validation method to test the robustness of our estimation 
of the a and v values and evaluate if it is possible to use them to generate 
commuting networks in new case studies. The dataset (including 80 case-studies) 
is randomly cut into two sets, called the training set (composed of 53 areas) 
and the testing set (composed of 27 areas). We use the training set to build 
a regression model giving the estimates of a and v. From these estimates and 
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Fig. 4. CoiTiiTion part of commuters for the 80 regions. Tlie squares represents tlie 
CPC obtained witli the calibrated P value. The triangles represents the average CPC 
of the virtual networks build with the /3 values estimated by the cross validation pro- 
cedure(over about 3500 estimated /3 values for each region) ; dark bars represent the 
minimum and the maximum CPC obtained from the network build with the estimated 
/5 but in most of the cases they are too closed to the average to be seen. 



from equation 2, we compute (3 of the 27 regions of the testing set. We repeat 
the cross-vahdation process 10, 000 times obtaining about 3500 j3 estimations 
for each case study. Then we calculate the CPC for the value of /3 calibrated on 
the data (with the KS distance) and for each of the 3500 values obtained by the 
cross-validation method. Fig. 4 shows, for each case study, the CPC associated 
with the calibrated /3, the average CPC obtained with the /3 values estimated 
from the cross-validation and the confidence interval defined by the minimum 
and the maximum values (but it is too small to be seen in most cases). The CPC 
obtained with the calibrated [3 value (black triangle) is almost the same as the 
average CPC obtained with the estimated /3 value (red square). We can observe 
that the average CPC obtained with the estimated /3 value is, for some areas, 
higher than the CPC obtained with the calibrated /3 value. It's possible that 
the common part of commuters are better with another /3 value because it's not 
the calibration criterion. Globally, we can conclude that the (3 values obtained 
with the log-linear model lead to the same values of the CPC indicator as the 
calibrated values. The method appears therefore fairly robust and can be used 
in other case studies with high confidence. 

4 Discussion 

We now discuss the interest of our proposal in comparison with two other im- 
portant studies, Balcan ct al. (2009) and Simini ct al. (2012). 

The objective of Balcan ct al. (2009) is to generate a worldwide commuting 
network, and the model must deal with the wide variety of populations and sur- 
faces of geographic units for which the data are available. To solve this difficulty, 
Balcan ct al. (2009) project this data on ad-hoc units defined with a Voronoi 
diagram. They define their basic unit as a cell approximatively equivalent to a 
rectangle of 25 x 25 kilometers along the Equator. This allows them to calibrate 
their model because a unit is the same object whatever the country is. This is 
an interesting solution for generating a world-wide commuting network but it 
leads to an average commuting distance of 250 km which is much larger than 
the average distance of daily commuting (51 km in US, 28 km in UK and less in 
most of the other European countries). We expect our approach to take a better 
account of the heterogeneities in the greographic units. 

The radiation model, proposed in Simini ct al. (2012), is a universal approach 
for generating commuting networks: the commuting fiow between two municipal- 
ities is a function of the cumulative job-opportunities at the distance between 
the two municipalities. The model has an elegant analytical solution and the 
average flow T*^ from unit i to unit j can be approximated by 

(r„) = TO,— ^ —^-^ (3) 

V N J [mi + Sij) [mi + Uj + Sij) 

where to^ and rij are respectively the population of units Uj and Uj , Nc is the to- 
tal number of commuters and A'' is the total population in the case-study region, 
and Sij the total population in the circle of radius r*-' centred at Ui (excluding 



the source and destination population). We implemented their analytical ap- 
proximation and reproduced the graphs presented in their paper. Fig. 5 shows 
the comparison between the radiation model and ours in the US for inter-county 
commuting and in the French Auvergne region for inter-municipality commuting. 
We observe that in both cases our approach yields significantly better results. In 
particular the CPC measure for the radiation model for the inter-municipality 
commuting in Auvergne is 0.3, which indicates a poor matching with the data. 
To be fair, it should be reminded that our model uses more specific data (total 
number of commuters in and out of each geographic unit) than the radiation 
model, hence one could expect our results to be more accurate. 

5 Conclusions 

We propose a universal model of commuting network considering an individual 
choice for its place of work based on the principles of the gravity law, defining the 
attraction of a possible place of work as a function of its " approximated" or real 
job opportunities and of its distance from the place of residence. We generate 
the virtual commuting network for the residents of the units composing a case- 
study region. Following this individual decision function, a heuristic matching 
is done between each possible jobs of the various units (defined by the data on 
the commuters in for each unit) and each job seekers living in the unit (defined 
by the data on the commuters out of each unit). We show this model very 
relevant whatever the unit size is. It is in particular much more relevant than 
the other universal approaches since it allows building commuting between units 
of small size. This is more convenient to describe everyday commuting mostly 
corresponding to short distances. Moreover the stochastic property of our model 
allows to avoid considering small fiows, especially those at short distance to a 
small unit, as deterministic. Once again, this last property is very relevant for 
virtual commuting among small units while at the same time informative on the 
confidence interval for large flows. 
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Fig. 5. Comparing tiie predictions of tiie radiation model witii ours. Plots (a)-(c) US at 
county level, plots (d)-(f) Auvergne region (France) at municipality level). Plots (a),( 
b), (d), (e): comparison between the measured flows and the generated flows. Grey 
points are the scatter plot for each pair of counties. The black circles represent the 
average number of generated travelers in that bin. (a) and (d) plot the radiation model 
while (b) and (e) our model. Plots (c) and (f): commuting distance distributions of 
US (c) and Auvergne (f); the blue line represents the observed data, the red one the 
results of our model and the green one the results of the radiation model. 
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6 Appendix: Datasets 

Commuting data are usually provided by statistical offices in the form of origin- 
destination tables. We analyzed 80 regions from 7 differents datasets. In these 
description the outside are the units in U^-^'^ but not in U. 

6.1 Czech republic dataset at the municipality scale 

This dataset is composed of the number of commuters between each couple of 
municipalities of South Moravia (A Czech Republic region)'^. With this dataset 
we have built 1 region and its outside. The outside is composed of all the units 
of Czech Republic of except the ones belonging to the region. The region is 
identified with CZ. 

6.2 French dataset at the municipality scale 

This dataset is composed of the number of commuters between each couple of 
municipalities of France. The distance used is the Euclidean distance computed 
with the Lambert coordinates. With this dataset we have built 34 regions (French 
region or districts) and their outside composed by the neighboring French dis- 
tricts. This dataset is measured for the 1999 French Census by the French Sta- 
tistical Institute, INSEE. They were kindly made available by the Maurice 
Halbwachs Center. These regions are identified from FRl to FRiA. 

6.3 French dataset at the "canton" scale 

This dataset is the same as the previous one at the " canton" scale (larger surface 
than municipality). The distance used is the Euclidean distance computed with 
the latitude and longitude. We used the longitude and latitude to build 14 regions 
and their outside. The outside is composed of all the units except that which 
compose the region. These region are identified from FRcl to FRclA. We also 
used the complete French network (Noted FRcO) without outside. 

6.4 French dataset at the " departement" scale 

This dataset is the same that the previous one at the " departement" scale (larger 
surface than municipality and "canton"). The distance used is the Euclidean 
distance compute with the latitude and longitude. We built the complete French 
network (Noted FRdO) without outside. 

6.5 Italy dataset at the municipality scale 

This dataset is composed of the number of commuters between each couple of 
municipalities of Italy and the latitude and longitude of each municipality. We 
used the longitude and latitude to build 10 regions and their outside. The outside 
is composed of municipalities at a reasonable distance of the border of the region. 
These regions are identified from ITl to ITIO. 

^ Data are available online at http://www.czso.cz/xb/edicniplan.nsf/publ/13-6231-04- 



6.6 Italy dataset at provincial level 

The last dataset is the same that the previous one at the " provincia" scale (larger 
than municipality). We used the longitude and latitude to build 4 regions and 
their outside. The outside is composed of all the units except that which compose 
the region. These regions are identified from ITpl to ITpA. We also used the 
complete Italian network (Noted ITpO) without outside. 

6.7 United State of America dataset at the county scale 

This dataset is composed of the number of commuters between each couple 
of counties of USA^ and the latitude and longitude of each county^. We used 
the longitude and latitude to build 14 regions and their outside. The outside is 
composed of all the units except that which compose the region. These regions 
are identified from USAl to USAIA. We also used the complete USA network 
(Noted USAO) without outside. 



* Available online at http://www.census.gov/population/www/cen2000/coniniuting/index.html 
^ Available online at http://www.census.gov/geo/www/gazetteer/places2k.htnil 



Table 1. Description of the regions 



Average Standard 
Area Unit deviation 
(km ) Area Unit area 
(km^) (km^) 



Number of Number of 
Region units units 

(region) (outside) 



Number of 

com- Type of unit 



cz 


43 


FRl 


1310 


FR2 


1269 


FR3 


419 


FR4 


903 


FR5 


2296 


FR6 


261 


FR7 


185 


FR8 


1464 


FR9 


1842 


FRIO 


3020 


FRll 


747 


FR12 


1786 


FR13 


1420 


FR14 


433 


FR15 


515 


FR16 


2339 


FR17 


260 


FR18 


1545 


FR19 


1948 


FR20 


36 


FR21 


262 


FR22 


185 


FR23 


47 


FR24 


377 


FR25 


195 


FR26 


547 


FR27 


163 


FR28 


327 


FR29 


102 


FR30 


40 


FR31 


196 


FR32 


463 


FR33 


433 


FR34 


286 


FRcO 


3146 


FRcl 


1062 


FRc2 


523 


FRc3 


226 


FRc4 


160 


FRc5 


55 


FRc6 


869 


FRc7 


2088 


FRc8 


100 


FRc9 


600 


FRclO 


302 


FRcll 


906 


FRcl 2 


1500 


FRcl3 


32 


FRcl4 


506 


FRdO 


94 


ITl 


377 


IT2 


395 


IT3 


1002 


IT4 


201 


1X5 


204 


IT6 


51 


IT7 


2000 


IT8 


186 


IT9 


1510 


ITIO 


705 


ITpO 


99 


ITpl 


50 


ITp2 


30 


ITp3 


20 


USAO 


3108 


USAl 


1015 


USA2 


103 


USA3 


54 


USA4 


2011 


USA5 


202 


USA6 


504 


USA7 


806 


USA8 


352 


USA9 


1507 


USAIO 


13 


USAll 


32 


USA12 


1004 


USA13 


207 


USA14 


301 



630 


35369 


822.54 


703.23 


13309 


Municipality 


3463 


26013 


19.86 


12.49 


295776 


Municipality 


1447 


27208 


21.44 


16.14 


653710 


Municipality 


2809 


5762 


13.75 


8.46 


162370 


Municipality 


3081 


8280 


9.17 


9.55 


440961 


Municipality 


2835 


41309 


17.99 


21.30 


700452 


Municipality 


3124 


5175 


19.83 


10.46 


69915 


Municipality 


1859 


5167 


27.93 


18.71 


12273 


Municipality 


2467 


25810 


17.63 


12.94 


375363 


Municipality 


4718 


39151 


21.25 


14.76 


624693 


Municipality 


3845 


45348 


15.02 


15.74 


546162 


Municipality 


3169 


16942 


22.68 


14.15 


139481 


Municipality 


3317 


16202 


9.07 


7.46 


268399 


Municipality 


3536 


12317 


8.67 


5.64 


469335 


Municipality 


3914 


6211 


14.34 


12.41 


42690 


Municipality 


3808 


5874 


11.41 


9.54 


92053 
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7 Appendix: Statistical tools 

7.1 Calibration 

In each case study area we build a normalized histogram P{d) describing the 
probability that a commuter travels a certain distance d. This histogram shows 
a typical log-normal shape in all the studied areas, with a peak varying from 
case to case. Each /3 value produces a different distance histogram: low values of 
(3 generate uniform distance distributions, while high values give exponentially 
decreasing structures. 

To obtain the /3 value for each case study, we minimize the Kolmogorov-Smirnov 
distance between the histogram for the observed and the simulated data: 

DKsiP) - sup |pO^^(d) - P^^*^(d, /3)| (4) 

d 

As we can observe in Fig. 6, the Kolmogorov-Smirnov distance presents a 
clear minimum in correspondence of an optimal (3 value that is different in dif- 
ferent areas. 

7.2 Validation 

After we found the optimal value of the parameter we must verify the effi- 
ciency of the model in reproducing the data. To evaluate the validation proce- 
dure we use two origin-destination matrices (Table 2), the observed one Y G 
M(„_|_i)x(n+i)(N) and the simulated one Y e M(„+i)x(n+i)(N). Y can be easily 
obtained by difference with the total number of in-commuters {s^x)i<\<Ntot': 
the total number of out-commuters (s°"*)i<i<n and the light grey table of the 
Table 3 correponding to W. To compare Y and Y we use as statistical indicator 
the S0rensen similarity index, an indicator usually used to evaluate the similar- 
ity of content of different samples for ecological problems. In our specific case 
we specifically call the index " Common part of commuters" and we define it in 
the following way: 

CPCiY,Y)= ^^^^(^-^), (5) 

NC{Y) + NC{Y) 

where NCC{Y, Y) is the number of commun commuters between the two 
sets: 

ri+l ri+1 

NCC{Y, Y) = Y,Y. "^^""^^^J ' ^^j) (6) 

NCiY) and NC{Y) are respectively the number of commuters in the ob- 
served and simulated sets: 

n+l n+1 n-t-1 n+1 

^^(^) - E E ^^1 ^^(^) = E E ^^= (7) 

2—1 J — 1 2 — 1 j — 1 



This indicator varies between if the simulation values never reproduce the 
observed ones to 1 if the perfect agreement is realized. 

In the lower plot of Fig. 6 we can observe that the model reaches the best 
performa.nce in renrodiicinp' the nrimual results l^hip'her Sortreuseu index^ exaetlv 
intt 
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Fig. 6. Upper plot: Kolmogorov-Smirnov distance between the real and the simulated 
distance distributions as a function of beta, for some case study areas. Lower plot: 
S0rensen index as a function of /3. Each point is the result of 100 replicas of the 
generation process 



Table 2. Origin-destination table; The light grey table represents the commuters liv- 
ing and working in the region for each municipality of the region; The grey columns 
represent the out-commuters living in the region and working outside (Out.) for each 
municipality of the region; The grey line represents the in-commuters working in the 
region and living outside (Out.) for each municipality of the region; The dark grey 
line(column) represents the total number of out(in)-commuters for each municipality 
of the region. 
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Table 3. Origin-destination table from the region to the region and the outside; The 
light grey table represents the commuters living (place of residence RP) and working 
(place of work WP) in the region for each municipality of the region; The grey table 
represents the commuters living (place of residence RP) in the region and working 
(place of work WP) outside of the region. 
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